Rare variant analyses validate known ALS genes in a multi-ethnic population and identifies ANTXR2 as a candidate in PLS

Background Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease affecting over 300,000 people worldwide. It is characterized by the progressive decline of the nervous system that leads to the weakening of muscles which impacts physical function. Approximately, 15% of individuals diagnosed with ALS have a known genetic variant that contributes to their disease. As therapies that slow or prevent symptoms continue to develop, such as antisense oligonucleotides, it is important to discover novel genes that could be targets for treatment. Additionally, as cohorts continue to grow, performing analyses in ALS subtypes, such as primary lateral sclerosis (PLS), becomes possible due to an increase in power. These analyses could highlight novel pathways in disease manifestation. Methods Building on our previous discoveries using rare variant association analyses, we conducted rare variant burden testing on a substantially larger multi-ethnic cohort of 6,970 ALS patients, 166 PLS patients, and 22,524 controls. We used intolerant domain percentiles based on sub-region Residual Variation Intolerance Score (subRVIS) that have been described previously in conjunction with gene based collapsing approaches to conduct burden testing to identify genes that associate with ALS and PLS. Results A gene based collapsing model showed significant associations with SOD1, TARDBP, and TBK1 (OR = 19.18, p = 3.67 × 10–39; OR = 4.73, p = 2 × 10–10; OR = 2.3, p = 7.49 × 10–9, respectively). These genes have been previously associated with ALS. Additionally, a significant novel control enriched gene, ALKBH3 (p = 4.88 × 10–7), was protective for ALS in this model. An intolerant domain-based collapsing model showed a significant improvement in identifying regions in TARDBP that associated with ALS (OR = 10.08, p = 3.62 × 10–16). Our PLS protein truncating variant collapsing analysis demonstrated significant case enrichment in ANTXR2 (p = 8.38 × 10–6). Conclusions In a large multi-ethnic cohort of 6,970 ALS patients, collapsing analyses validated known ALS genes and identified a novel potentially protective gene, ALKBH3. A first-ever analysis in 166 patients with PLS found a candidate association with loss-of-function mutations in ANTXR2. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10538-1.


Background
Amyotrophic lateral sclerosis (ALS) is a rare neurodegenerative disease characterized by the progressive loss of upper motor neurons in the cortex and lower motor neurons of the brainstem and spinal cord.Even with FDAapproved disease modifying medication and palliation by artificial nutrition and ventilation, the prognosis is poor and death from accumulating paralysis occurs a median of 32 months after symptoms first manifest [1].Over the last 30 years, genetic study of the 5-10% of ALS patients with family history [2,3] have securely implicated ~ 20 monogenic causes and showed possible association to a similar number of genes (https:// clini calge nome.org/ affil iation/ 40096/).Causative mutations in the most prevalent ALS genes (C9ORF72, SOD1, TARDBP, and FUS) explain ~ 70% of familial ALS [4,5].Due in part to incomplete penetrance, 10% of simplex ALS cases, which show no known family history, also carry mutations in these same genes [6].
A paucity of unsolved ALS pedigrees for family studies has intersected with falling sequencing costs for largescale sequencing to allow gene discovery studies based on rare variant burden or collapsing methods on cohorts using predominantly simplex patients.Since our group first used this approach to implicate TBK1 and NEK1, others have also identified DNAJC7, TUBA4A and several candidates [6][7][8][9].These analyses utilized the entire gene or recognizable functional domains as the regions for burden testing [6,7] and were restricted to cohorts with European ancestry, or with less than 5% non-European ALS cases.Recognizing that power for discovery could be improved by a) increasing case and control numbers, b) diversifying the ancestries of participants, and c) collapsing on domains known to be intolerant of variation, we conducted both standard gene and intolerant domain-based collapsing analyses on 6,970 multiethnic ALS cases and ancestry-matched controls.
Primary lateral sclerosis (PLS) is also a neurodegenerative disease of motor neurons with clinical features, neuropathology, and some genetics that overlaps with ALS [10][11][12].PLS is nearly always simplex and 20 times rarer than ALS [13].Because large-scale sequencing studies of ALS often include PLS patients, we were able to conduct a gene-based collapsing analysis in 166 PLS multi-ethnic cases and ancestry matched controls.

Whole exome and genome sequencing
DNA sequencing was performed at Columbia University, the New York Genome Center, Duke University, McGill University, Stanford University, HudsonAlpha, and University of Massachusetts, Worcester.Kits used to conduct whole-exome capture are as follows: Agilent All Exon kits (50 MB, 65 MB, and CRE), Nimblegen SeqCap EZ Exome Enrichment kits (V2.0,V3.0, VCRome, and MedExome), and IDT Exome Enrichment panel.There were 2,185 participants with ALS who were sequenced using Nimblegen SeqCap EZ Exome Enrichment kits and 51 who were sequenced using the IDT Exome Enrichment panel (Supplemental Table 1).While 1,272 controls were evaluated using the Aligent All Exon kits, 8,498 with the IDT Exome Enrichment panel, and 11,201 with the Nimblegen SeqCap EZ Exome Enrichment kits.Sequencing was performed using Illumina GAIIx, HiSeq 2000, HiSeq 2500, and NovaSeq 6000 sequencers according to standard protocols.Whole genome sequencing was conducted at the New York Genome Center and in-house at the IGM.Sample-level BAM files were transferred from the New York Genome Center to the IGM (n = 3,418).An additional 1,316 genomes were processed by the IGM.There were 1,553 genomes in our control cohort (Supplemental Table 1).Data were aligned to the human reference genome (NCBI Build 37) using DRAGEN (Edico Genome, San Diego, CA, USA).Picard (http:// picard.sourc eforge.net) was used to remove duplicate reads and to process lane-level BAM files to create a sample-level BAM file.GATK was used to recalibrate base quality scores, realign around indels, and call variants utilizing the Best Practices recommendations v3.6 [14].Variants were annotated using ClinEff and the Analysis Tool for Annotated Variants (ATAV), an in-house IGM annotation tool [15].Variants were annotated with the Genome Aggregation Database (gnomAD) v2.1 frequencies, regional-intolerance metrics, and the clinical annotations by the Human Gene Mutation Database (HGMD), Clin-Var, and Online Mendelian Inheritance in Man (OMIM).Exonic regions were retained for downstream statistical analyses.

Sample and variant quality control
Samples reporting > 2% contamination according to veri-fyBamID [16] and those with consensus coding sequence (CCDS release 20) < 90% were excluded from these analyses.KING [17] was used to test for relatedness.Only unrelated (up to second-degree) individuals were included in these analyses.For related pairs, samples were chosen to prefer cases.Samples where X:Y coverage ratios did not match expected sex were excluded.
Only variants within the CCDS or the 2 bp canonical sites were included in these analyses.These variants were also required to have a quality score of at least 50, a quality by depth score of at least 5, genotype quality score of at least 20, read position rank sum of at least − 3, mapping quality score of at least 40, mapping quality rank sum greater than − 10, and a minimum coverage of at least 10.SNVs had a maximum Fisher's strand bias of 60, while indels had a maximum of 200.For heterozygous genotypes, the alternative allele ratio was required to be greater than or equal to 30%.Only variants with the GATK Variant Quality Score Recalibration filter "PASS", "VQSRTrancheSNP90.00to99.00",or "VQSRTrancheSNP99.00to99.90"were included.Variants were excluded if they were marked by EVS, ExAC, or gnomAD as being failures (http:// evs.gs.washington.edu/EVS/).

Clustering, ancestry, and coverage harmonization
A neural network pre-trained on samples of known ancestry was used to calculate probability estimates for six ancestry groups (African, East Asian, European, Hispanic, Middle Eastern, and South Asian).Methods for characterizing samples into clusters has been previously described [18].
To ensure balanced sequencing coverage of evaluated sites between cases and controls, we imposed a statistical test of independence between the case/control status and coverage as previously described [19].Sites were removed where the absolute difference in percentages of cases and controls with at least 10 × coverage was greater than 7%.Samples were then pruned using this method on a cluster-by-cluster basis.Through this approach, approximately 7-11% were removed.Clusters with less than 5 participants were not included in these analyses, thereby removing 6 participants with PLS but none with ALS.

Variant-level and gene-level statistical analysis
The models that were used to test for associations of nonsynonymous coding or canonical splice variants with outcome included variants with MAF < 0.1% for each population represented in gnomad and internal AF of < 0.1%.Models tested were a standard gene-unit collapsing analysis, and a domain-unit analysis.The models used for these analyses were previously described [7].A domain-based approach utilizing sub-region Residual Variation Intolerance Score (subRVIS) domain percentage [7] with a threshold of 25 was also used to evaluate case enrichment of rare variants.The full list of 18,653 CCDS genes was analyzed for each model.Genes with at least one qualifying variant were included for analyses.As we are meta analyzing across clusters an exact 2-sided Cochran-Mantel-Haenszel test was used (using the statistical package in R v3.6).Study-wide significance was determined by accounting for 6 nonsynonymous models-multiplicity-adjusted significance threshold α = 4.9 × 10 -7 (Supplemental Table 2).Model inflation was calculated using empirical (permutation-based) expected probability distributions as described by Povysil and colleagues [18].

ALS and PLS rare variant burden testing
We conducted both standard gene and intolerant domainbased collapsing analyses on 6,970 multi-ethnic ALS cases (87% European) and 22,534 ancestry-matched controls.Standard gene collapsing analyses identified case enrichment of rare variants (minor allele frequency of 0.001) in an ALS cohort with 12 sub-population groups (Supplemental Fig. 1A) that correspond to ancestry-based clusters (Supplemental Fig. 1B; Supplemental Table 3).Analyses were conducted on clusters with at least 3 cases.Controls were drawn from individuals sequenced for phenotypes/ diseases with no known association with ALS (Supplemental Table 4).As expected, a negative control analysis for rare synonymous variants found no case-enrichment (Supplemental Fig. 2).Because gene-based collapsing considers variation across the entire gene, regions that are tolerant of variation could swamp case-enrichment signals originating from regions that are intolerant to variation [7].To overcome this limitation, we conducted rare variant collapsing on domains that are intolerant to variation as defined as a subRVIS domain score threshold of 25, a cutoff based on threshold testing.
As large-scale sequencing studies of ALS often include PLS patients, we were able to conduct a genebased collapsing analysis in 166 PLS multi-ethnic cases (88% European) and 17,695 ancestry matched controls (Supplemental Fig. 3; Supplemental Tables 5).
We expected the study would be underpowered for securely implicating causative genes but used this as an opportunity to generate candidates for future study.

ALS gene set enrichment analyses
An ALS gene set enrichment analysis was conducted using the gene strength association list outlined in Table 1.We utilized the qualifying variants that were associated with ALS in each gene set category and used the exact two-sided CMH test to analyze burden of ALS genes defined by gene set.These lists were curated using data published by Gregory and colleagues [20] and the Clinical Genome Resource (ClinGen).As outlined, "ALS Definite" genes were found to have ample published replication evidence, while ' ALS Plus' genes had some replication data and/or functional evidence for an association with ALS.However, ' ALS Moderate genes, required additional replication analyses and/or functional data, and ' ALS Limited' genes were genes that overlapped with ALS phenotypically.

Rare variant burden testing
Collapsing analysis of all rare functional variants (missense and protein truncating variants) (Supplemental Table 2) found genome-wide and study-wide significant (p < 4.9 × 10 -7 ) case-enrichment for known ALS genes SOD1, TARDBP, TBK1 (OR = 19.18,p = 3.67 × 10 -39 ; OR = 4.73, p = 2 × 10 -10 ; OR = 2.3, p = 7.49 × 10 -9 , respectively) and control-enrichment for ALKBH3 (OR = 0.26, p = 4.88 × 10 -7 ) (Fig. 1A; Supplemental Data).Although SOD1, TBK1 and TARDBP are definitive ALS genes, we were intrigued by the identification of control-enriched ALKBH3.Control-enrichment was not explained by sequencing methodology, ancestry cluster, or specific phenotype/disease population within the control cohort.Because ALKBH3 plays a role in DNA repair [21], a mechanism increasingly implicated in ALS pathogenesis [22], we attempted to replicate this novel association by analyzing summary statistics from the Project MinE cohort, which is similar in size to ours [23].None of the available models focused on variation as rare as in our analyses, but at a higher minor allele frequency (MAF) for qualifying variants (0.005), a minor degree of control-enrichment was in fact observed (OR = 0.56, p = 3.96 × 10 -4 ).This raises the possibility that rare missense and protein truncating variants (PTVs) in ALKBH3 could protect from ALS, a finding that requires validation in large cohorts.
Intolerant domain analyses implicated the same three known ALS genes (SOD1, TARDBP, and TBK1 at OR = 20.63,p = 1.68 × 10 -38 ; OR = 10.08,p = 3.62 × 10 -16 ; and OR = 3.15, p = 8.38 × 10 -11 , respectively) (Fig. 1B; Supplemental Data).The intolerant domain analysis did not improve over the gene-based analysis for SOD1 or TBK1 (Fig. 2; Fig. 3) but doubled the odds ratio and significantly lowered the p-value obtained for TARDBP.The improvement of the intolerant domain model (Fig. 1C,  1D) stemmed from a significant drop (one-tailed z-score p = 0.031) in the number of qualifying variants found in controls dispersed across tolerant regions, while highlighting qualifying variants in ALS cases predominantly in the intolerant C-terminal region.
Although most models showed no significant genes, the dominant PTV model showed significant case enrichment for ANTXR2 (OR = 174.57,p = 8.38 × 10 -6 ) (Fig. 4; Supplemental Table 6; Supplemental Data), a gene associated with brain connectivity changes and Alzheimer's disease [24].Currently, there are no additional large sequencing studies of PLS in which we could attempt replication.

ALS gene set enrichment analyses
A gene set enrichment analysis of genes that were defined as ' ALS Definite' were significantly associated with ALS for all dominant models, including PTV only (p < 1.49 × 10 -72 ), PTV & Missense (p < 7.93 × 10 -24 ), and Missense only (p < 8.27 × 10 -28 ) (Fig. 5).The synonymous model, which served as a control, showed no association (p = 0.5) between these genes and ALS.Genes that are moderately associated with ALS, ' ALS Moderate, showed significant enrichment of rare variants for the PTV & Missense (p < 7.9 × 10 -6 ), as well as the Missense only (p < 8.18 × 10 -5 ) models.The group of genes that were described as having limited evidence, ' ALS Limited' , showed a significant association with rare variants and ALS for the PTV only model (p = 0.032).For all other models, rare variants in these genes were not significantly associated with ALS.An analysis of genes that are characterized as ' ALS Plus' showed no significant association of rare variants with ALS for the 4 models that were analyzed.

Burden testing
Conducting genic and intolerant domain based collapsing analyses in a large multi-ethnic population provides insight into novel and established biological mechanisms in disease manifestations.Additionally, analyzing specific disease subtypes can capture critical disease pathways that could be targets for clinical intervention.Here we show, that performing collapsing analyses in multi-ethnic populations and in disease subtypes found novel genetic associations in individuals diagnosed with ALS and PLS.These analyses implicated ALS genes that have previously been identified (SOD1, TARDBP, and TBK1).We also identified ALKBH3 as a potentially protective gene that warrants further study in additional cohorts.In addition, we conducted the first rare variant collapsing analysis in PLS, identifying PTVs in ANTXR2.This gene will need to be investigated further in larger PLS cohorts or in targeted functional analyses.Lastly, gene set enrichment analyses provide evidence that genes known to be associated with ALS show strong evidence to have a rare variant burden especially for protein truncating variants.

ALKBH3 associates with ALS
We found that genic collapsing analyses of individuals diagnosed with ALS identified known risk genes (SOD1, TARDBP, and TBK1) and a novel protective gene (ALKBH3).ALKBH3 encodes for AlkB homolog 3, Alpha-Ketoglutarate Dependent Dioxygenase which protects against the cytotoxicity of methylating agents by repair of the specific DNA lesions [25][26][27].ALKBH3 potentially acts as a putative hyperactive promotor to suppress transcription associated DNA damage of highly expressed genes [28].Genes that play a role in DNA repair and DNA damage response such as TAR-DBP, FUS, and NEK1 [29][30][31][32] are known to play a role in ALS potentially through neuronal death pathways.

ANTXR2 associated with PLS
Genic collapsing analyses of protein truncating variants on individuals with PLS identified a suggestive gene (ANTXR2).ANTXR2 encodes a receptor for anthrax toxin that may be involved in extracellular matrix adhesion.Variants in this gene have been associated with hyaline fibromatosis [33,34], and has been shown to play a role in angiogenesis [35].This finding adds to the number angiogenic genes that have been implicated in ALS including VEGF and ANG [36].
While we identified a potentially important gene that is associated with PLS, we were limited in our sample size and will therefore need additional cohorts or functional studies to further investigate this finding.Additionally, there are potentially more ALS subtypes that could be investigated to better understand this heterogeneous disease.Lastly, unknown confounders could be contributing to the signal that are found in these association analyses.

Conclusions
In summary, we performed the largest rare variant analyses of a multi-ethnic population of patients with ALS to date.Our analysis did not identify new ALS risk genes but demonstrated that collapsing models informed by regions of intolerance can be useful for identifying genes where diseaseassociated variation is limited to regions with low background variation.This analysis also confirmed the association of the C-terminal domain of TARDBP.We also identified ALKBH3 as a potentially protective gene that warrants further study in additional and larger cohorts.Finally, we conducted the first rare variant collapsing analysis in PLS, identifying PTVs in ANTXR2 as a candidate disease gene.This association and potential mechanisms for PTVs in this gene will need to be investigated further in larger PLS cohorts.
It is important to note that this analysis doubled the number of ALS cases and quadrupled the number of controls from our first study [6] but remained underpowered for the identification of new ALS genes.A recently published rare variant burden analysis with a similar number of ALS cases did not identify new genes [23] either, emphasizing the need for increasingly large genomically characterized ALS cohorts, especially in non-European populations.

Availablity of data and materials
All summary data generated during this study are included in this published article and its supplementary information files.

Fig. 1
Fig. 1 Q-Q plots of gene-and domain-level collapsing of ALL functional coding variants in ALS cohort.A The results for a standard gene-level collapsing of 6,970 ALS cases and 22,524 controls.P-values were generated using an exact two-sided Cochran-Mantel-Haenszel (CMH) by gene by cluster.The genes with the top associations that achieved study-wide significance of p < 4.9 × 10 -7 (SOD1 (OR = 19.18),TARDBP (OR = 4.73), TBK1 (OR = 2.3), and ALKBH3 (OR = 0.26)) are labeled.SOD1, TARDBP, TBK1 have been previously implicated in rare variant association studies of ALS.Yellow and green lines indicate the 2.5th and 97.5.th percentile of expected p-values, respectively.B The results for the domain-based collapsing restricting qualifying variants to those with subRVIS domain percentage score < 25 of 6,970 cases and 22,524 controls.P-values were generated using an exact two-sided Cochran-Mantel-Haenszel (CMH) by gene by cluster.The genes with the top associations (SOD1 (OR = 20.63),TARDBP (OR = 10.08), and TBK1 (OR = 3.15)) are labeled.C Standard gene-level collapsing model showed 44 qualifying variants in cases (red circles) and 31 in controls (blue circles) for TARDBP (D) subRVIS domain collapsing improved association by removing control variants (cases = 43; controls = 15).Regions with subRVIS domain percentage below 25 are highlighted in orange while those above this threshold are highlighted in blue.A one tailed z-score showed that there were significantly less controls in the intolerant domain as indicated by subRVIS domain percentage score < 25 (p = 0.031)

Fig. 2 Fig. 3 Fig. 4 QFig. 5
Fig. 2 Plot of gene-and domain-level collapsing of ALL SOD1 functional coding variants.Standard gene-level collapsing model showed 93 qualifying variants in cases (red circles) and 18 in controls (blue circles) for SOD1.subRVIS domain collapsing improved association by removing control variants (cases = 90; controls = 16).Regions with subRVIS domain percentage below 25 are highlighted in orange while those above this threshold are highlighted in blue.However, a one tailed z-score showed that the differences in the number of controls in the intolerant domain was not significantly lower than those in the entire gene as indicated by subRVIS domain percentage score < 25 (p = 0.4)

Table 1
ALS gene association strength